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Abstract 

Many  CFD  (computational  fluid  dynamics)  and  other  scientific  applications  can  be 
partitioned  into  subproblems.  However,  in  general  the  partitioned  subproblems  are 
very  large.  They  demand  high  performance  computing  power  themselves,  and  the  so¬ 
lutions  of  the  subproblems  have  to  be  combined  at  each  time  step.  In  this  paper,  the 
cube-connect  cube  (CCCube)  architecture  is  studied.  The  CCCube  architecture  is  an 
extended  hypercube  structure  with  each  node  represented  as  a  cube.  It  requires  fewer 
physical  links  between  nodes  than  the  hypercube,  and  provides  the  same  communica¬ 
tion  support  as  the  hypercube  does  on  many  applications.  The  reduced  physical  links 
can  be  used  to  enhance  the  bandwidth  of  the  remanding  links  and,  therefore,  enhance 
the  overall  performance.  The  concept  and  the  method  to  obtain  optimal  CCCubes, 
which  are  the  CCCubes  with  a  minimum  number  of  links  under  a  given  total  number 
of  nodes,  are  proposed.  The  superiority  of  optimal  CCCubes  over  standard  hypercubes 
has  also  been  shown  in  terms  of  the  link  usage  in  the  embedding  of  a  binomial  tree. 
A  useful  computation  structure  based  on  a  semi-binomial  tree  for  divide-and-conquer 
type  of  parallel  algorithms  has  been  identified.  We  have  shown  that  this  structure  can 
be  implemented  in  optimal  CCCubes  without  performance  degradation  compared  with 
regular  hypercubes.  The  result  presented  in  this  paper  should  provide  a  useful  approach 
to  design  of  scientific  parallel  computers. 
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tract  NASl-19480  while  the  first  author  was  in  residence  at  the  Institute  for  Computer  Applications  in  Science  and 
Engineering  (ICASE),  NASA  Langley  Research  Center,  Hampton,  VA  23681-0001. 
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1  Introduction 


Rapidly  advancing  technology  has  made  it  possible  for  a  large  number  of  processors  to  be  intercon¬ 
nected  to  form  a  single  multiprocessor  system.  In  recent  years,  the  multiprocessor  approach  has 
been  shown  to  be  the  most  straightforward  and  cost-effective  way  for  achieving  high  performance. 
However,  the  way  in  which  processors,  memory  modules,  and  switches  should  be  interconnected 
to  form  an  efficient  architecture  remains  a  research  issue.  Parallel  computers  have  been  built  with 
a  variety  of  architectures.  One  of  the  popular  parallel  architectures  is  the  hypercube  architecture 
[16],  also  known  as  the  binary  n-cube,  which  contains  2n  processors,  each  of  which  is  connected 
by  fixed  communication  links  to  n  other  nodes.  The  value  n  is  known  as  the  dimension  of  the 
hypercube.  In  a  hypercube  structure  two  nodes  are  connected  if  and  only  if  their  addresses  differ 
in  one  and  only  one  bit. 

The  hypercube  structure  has  many  desirable  properties.  It  is  symmetric.  Any  n  dimensional 
cube  can  be  divided  into  two  n  -  1  dimensional  cubes.  Many  other  topologies,  such  as  ring,  mesh, 
and  tree,  can  be  mapped  into  the  hypercube  topology.  It  is  rich  in  connection,  a  message  can  be 
transferred  from  one  node  to  all  the  other  nodes  in  a  total  of  n  steps  in  an  n-cube.  Extensive 
research  efforts  have  been  focused  on  hypercube  design  aspects  and  hypercube  applications.  Most 
of  the  first  generation  and  second  generation  distributed-memory  multiprocessors  are  based  on 
hypercube  architecture.  Examples  of  these  commercial  products  include  FPS’s  T  series,  Ncube’s 
nCUBE,  Ametek’s  S/14,  Intel’s  iPSC,  and  Thinking  Machine’s  Connection  Machine,  which  is  a 
hypercube  interconnected  bit-serial  SIMD  machine. 

Efforts  have  also  been  made  to  vary  the  hypercube  topology  to  obtain  better  interconnection 
networks.  Many  variations  of  the  hypercube  topology,  such  as  twisted  hypercubes  [5],  enhanced 
hypercubes  [21],  extended  hypercubes  [11],  bridged  hypercubes  [3],  incomplete  hypercubes  [10]  and 
Fibonacci  cubes  [7],  balanced  hypercubes  [8]  and  folded  hypercubes  [4],  etc.,  have  been  proposed. 
These  new  architectures  keep  the  desirable  properties  of  hypercubes,  and  incorporate  new  features 
that  are  more  suitable  for  some  specific  applications  and  objectives.  The  Cube-Connected  Cube 
(CCCube)  structure  [23]  is  one  of  the  variations  of  hypercube  topology.  A  CCCube  is  an  extended 
hypercube  structure  with  each  node  represented  as  a  cube.  With  the  same  number  of  processors,  A 
CCCube  requires  few  physical  links  than  a  comparable  hypercube  and  provides  the  same  support 
as  the  hypercube  does  in  many  ways.  The  routing  and  broadcasting  algorithms  in  the  CCCube 
have  been  discussed  in  several  previous  studies  [6],  [23]. 

The  parallel  divide-and-conquer  paradigm  is  a  computation  paradigm  which  partitions  a  single 
complex  problem  into  a  set  of  subproblems,  which  are  further  divided  until  every  independent 
subproblem  has  been  broken  up  sufficiently.  After  all  the  subproblems  have  been  solved,  data  (or 
results)  arc  collected.  The  above  process  can  be  represented  by  a  binomial  tree  structure.  Lo  ct  al.  — 
[12]  have  shown  that  the  binomial  tree  is  an  ideal  computation  structure  for  parallel  dividc-and-  — - 
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conquer  algorithms,  and  is  superior  to  the  classic  full  binary  tree  structure  with  respect  to  speedup 
and  efficiency.  Since  a  large  number  of  parallel  algorithms  are  divided-and-conquer  in  nature,  the 
ability  to  embed  (or  map)  a  binomial  tree  into  a  network  can  be  considered  an  important  measure 
of  the  network. 

This  paper  studies  the  capability  of  embedding  a  binomial  tree  in  a  CCCube.  We  first  prove 
that  an  i-level  binomial  tree  can  be  embedded  in  any  (m,n)-CCCube,  where  m  is  the  dimension 
of  the  outer  cube  and  n  is  the  dimension  of  the  inner  cube,  provided  that  m  +  n  >  i.  With  the 
objective  of  embedding  a  binomial  tree  in  a  CCCube  using  as  few  links  as  possible,  we  define  an 
optimal  CCCube  as  being  one  with  the  minimum  number  of  links  for  a  given  number  of  processors. 
Reducing  the  number  of  links  will  lead  to  a  higher  bandwidth  of  the  remanding  links,  lower  net  work 
contention,  and  thus  better  overall  performance.  The  selection  of  an  optimal  CCCube  for  a  given 
binomial  tree  is  also  provided  in  this  paper.  Comparison  is  made  between  CCCubes  and  standard 
hypercubes  in  terms  of  the  link  usage  in  the  embedding  of  binomial  trees.  We  also  identify  a  class 
of  parallel  algorithms  that  is  best  suited  for  optimal  CCCube  structures.  This  class  of  parallel 
algorithms  is  based  on  the  semi-binomial  tree  proposed  in  this  paper. 

This  paper  is  organized  as  follows:  Section  2  discusses  embedding  binomial  trees  in  (’('Cubes. 
The  determination  of  optimal  cube-connected  cubes  is  discussed  in  Section  3.  Section  4  identifies 
a  class  of  parallel  algorithms  based  on  the  semi-binomial  tree  structure.  A  parallel  merge  sorting 
example  is  used  to  illustrate  how  to  run  the  proposed  algorithm  on  optimal  CCCubes.  Section  5 
presents  conclusions.  A  comprehensive  comparison  of  CCCubes  with  other  cube-based  systems  has 
been  done  in  [22],  and  a  comparison  of  CCCubes  with  Cube-Connected  Cycles  (CCC)  [15]  can  be 
found  in  [13].  The  use  of  CCCubes  in  other  applications  can  be  found  in  [23]  and  [24]. 

2  Embedding  of  Binomial  Trees  in  CCCubes 

An  (m,  n)  cube-connected  cube  [23],  or  (m,  n)-CCC’ube,  is  defined  as  an  m-dimensional  hypercube 
(outer-cube)  with  each  node  in  the  hypercube  being  an  n-dimensional  hvpercube  (inner-cube). 
Assume  that  grngm~i...g\lnln~l...l-i  is  the  binary  address  associated  with  each  of  the  2m+"  nodes 
in  an  (m, n)-CCCube,  where  gm9m-i---9i  is  the  global  address  and  /„/„_]  .../j  is  the  local  address. 
The  least  significant  bit,  g\ ,  of  the  global  address  will  be  referred  as  global  dimension  1.  and  so 
on.  Similarly,  the  least  significant  bit  of  the  local  address  designates  local  dimension  1.  and  so  on. 
There  are  m  global  dimensions  and  n  local  dimensions  in  an  (m, n)-CCCube.  More  formally,  we 
have  the  following  recursive  definition  of  an  (m,7t)-CCCube: 

Definition  1  •  A  (0,  n)-CCCu6e  is  an  n-dimcnsional  hypcrcnbe  Qn,  with  one  node  in  (O.n)- 

CC'C'ube  a  designated  port  node. 
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(3, 2)-CCCube 


Figure  1.  Constructions  of  (3,2)  CCCube 

•  Suppose  G  and  G'  are  disjoint  (m  -  1  ,n)-CCCubes  for  m  >  1.  Then  the  graph  obtained  by 
adding  edges  between  all  the  port  nodes  in  G  and  the  corresponding  port  nodes  in  G'  is  an 
{m,n)-CCCube.  All  the  port  nodes  in  G  and  G'  are  the  port  nodes  in  this  (m,  n)~CCCube. 

Figure  l  illustrates  the  rule  for  building  a  (3, 2)-CCCube.  Basic  properties  of  a  CCCube  have 
been  studied  in  [23],  as  well  as  routing  and  broadcasting  algorithms. 

The  cube-connected  cube  architecture  has  many  desirable  properties.  If  we  view  the  inner- 
cubes  as  nodes,  then  the  outer-cube  forms  the  hypercube  architecture.  Therefore,  the  architecture 
is  symmetric,  rich  in  connection,  and  can  be  partitioned  into  subcubes.  The  nodes  in  each  inner- 
cube  provide  much  higher  computation  power  than  a  single  processor.  This  two-level  hypercube 
architecture  fits  many  scientific  applications  well.  For  instance,  the  3-D  turbulence  simulation 
code  CDNS  (Compressible  Direct  Simulation  of  Navier- Stokes)  [17],  which  is  used  in  and  out  of 
the  NASA  Langley  Research  Center  for  basic  research  in  the  physics  of  compressible  homogeneous 
turbulence,  calculates  spatial  derivatives  with  a  sixth-order  compact  scheme.  The  compact  scheme 
requires  solutions  of  a  large  sparse  system  with  multiple  right  sides,  where  each  right  side  can  be 
solved  on  an  inner-cube  concurrently,  and  then  the  solutions  of  each  inner-cube  can  be  combined 
through  the  outer-cubes  in  the  next  time  step.  In  general,  the  two-level  computation,  or  partition 
computational  paradigm,  is  applicable  to  any  simulation  based  on  the  compact  scheme.  It  is  also 
applicable  to  any  simulation  code  based  on  the  alternating  direction  implicit  (ADI)  method  and 
the  fast  Poisson’s  solvers  [17]. 
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CCCubes  also  support  any  program  paradigm  supported  by  hypercubes.  For  example,  the  to¬ 
tal  data-exchange  communication  [19],  the  data-gathering  communication,  and  the  cla,a-scattering 
communication  [18]  all  requires  log(n)  communication  steps  on  an  n-dimension  hypercube,  there¬ 
fore,  they  require  no  more  than  log(7«)  -i-  log(7i)  communication  steps  on  a  (m,  n)-CCCube.  In 
many  cases,  the  CCCube  architecture  provides  better  support  than  a  two-level  hypercube.  As  we 
mentioned  in  Section  1,  the  divide-and-conquer  paradigm  is  one  of  the  dominating  computation 
paradigms  in  parallel  processing.  The  partition  paradigm  given  above  can  be  seen  as  a  special  case 
of  the  divide-and-conquer  paradigm.  In  this  section,  we  prove  that  CCCube  provides  hypercube-like 
support  for  the  divide-and-conquer  paradigm. 

One  of  the  most  conventional  graph  representations  of  divide-and-conquer  algorithms  is  the 
binomial  tree  [1],  More  specifically,  an  i-level  binomial  tree.  B,,  can  be  recursively  defined  as 
follows: 

Definition  2 

•  Any  tree  consisting  of  a  single  node  is  a  Bo  tree. 

•  Suppose  that  T  and  T'  arc  disjoint  Bt_ j  trees,  for  i  >  1.  Then  the  tree  obtained  by  adding  an 
edge  to  make  the  root  of  T  become  the  leftmost  offspring  of  the  root  V  is  a  Bx  tree. 

Figure  2  shows  the  construction  of  high  level  binomial  trees  from  low  level  binomial  trees.  Lo  et 
al  [12]  show  the  binomial  tree  structure  as  an  ideal  computation  structure  for  parallel  divide-and- 
conquer  algorithms,  and  show  its  superiority  to  the  classic  full  binary  tree  structure,  with  respect  to 
speedup  and  efficiency.  Therefore  it  is  important  to  study  the  embedding  of  a  binomial  tree  into  a 
CCCube.  In  general,  the  embedding  problem  on  cube-based  systems  [2],  a  restricted  version  of  the 
mapping  problem  [14],  is  the  problem  of  mapping  a  particular  graph  structure  G  to  a  cube-based 
system  G' .  The  goal  of  the  mapping  problem  is  to  find  a  mapping  that  minimizes  the  length  of 
the  path  between  communication  processes  in  this  graph  structure  G.  Reducing  the  length  of  the 
communication  path  is  important.  Even  with  the  new  routing  schemes,  such  as  wormhole  routing 
or  circuit  switching,  shortening  the  path  length  will  reduce  the  network  contention  and  achieve 
better  performance  [20].  Dilation  and  congestion  are  two  measures  used  to  measure  the  quality 
of  an  embedding,  where  dilation  is  the  maximum  length  in  G  of  the  image  of  an  edge  of  G  and 
congestion  of  an  edge  of  G'  is  the  number  of  images  of  edges  of  G  that  pass  through  it. 

Theorem  1  An  i-level  binomial  tree  can  be  embedded  with  unit  dilation  in  any  (in.  v)  (’(’Cube, 
provided  that  m  +  n  >  i.  In  addition,  the  root  node  of  this  i-level  binomial  tree  ran  be  mapped  to 
any  port  node  in  the  ( m,n)-CCCube . 

Proof:  We  only  need  to  show  that  an  7-level  binomial  tree  ran  be  embedded  in  any  (in.n) 

CCCube,  where  m  +  n  =  i.  We  prove  it  by  using  induction  on  m.  When  m  =  0,  any  (0, /)-(’( 'Cube 
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Figure  2.  Binomial  trees 

is  an  t-dimensional  hypercube  Qi.  Therefore,  an  t-level  binomial  tree  can  be  embedded  in  this 
(0,  n)-CCCube  [16]  and  the  root  node  will  be  mapped  to  the  only  port  node  in  the  (0,  i)-CCCube. 
Suppose  when  m  <  i  -  1,  a  i-level  binomial  tree  with  i  >  m  can  be  embedded  in  any  ( m.n )* 
CCCube,  such  that  m  +  n  =  i,  and  its  root  node  to  one  of  the  port  nodes.  When  m  —  i,  a  i-level 
binomial  tree,  5,,  with  i  >  m,:  can  be  decomposed  into  two  disjoint  ( i  -  l)-level  binomial  trees: 

and  B\_x  with  an  edge  connecting  two  root  nodes  of  these  two  trees.  Also,  any  (m,  7i)-CCCube 
can  be  decomposed  into  two  (m—  1,  n)-CCCubes,  G  and  G',  with  edges  connecting  the  port  nodes 
of  G  and  G'.  Based  on  the  assumption,  can  be  embedded  in  G  with  the  root  node  assigned 
to  any  one  of  the  port  nodes,  say  a,  in  G.  Similarly,  5,_i  can  be  embedded  in  G'  with  the  root 
node  assigned  to  the  port  node  a',  the  matching  node  of  a  in  G.  Since  a  and  a'  are  connected  in 
the  (m,  7i)-CCCube,  the  edge  that  connects  the  root  node  of  f?;_i  and  B-_:  can  be  mapped  to  the 
edge  that  connects  a  and  a'.  □ 


3  Finding  the  Optimal  Cube-Connected  Cube 

Let  m,  n  be  the  dimension  of  the  outer-cube  and  the  inner-cube,  respectively.  The  following  theorem 
determines  how  to  choose  m  (or  n)  based  on  a  constant  c  =  m  4- »,  i.e.,  a  fixed  number  of  nodes, 

'We  don’t  need  to  consider  the  case  where  i  =  m,  since  the  corresponding  (»,  0)-CCCube  is  an  i-dimension&i 
hypercube. 
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such  that  the  (m,n)-CCCube  has  a  minimum  number  of  links. 

Note  that  in  an  (m,  w)-CCCube,  the  total  number  of  nodes  jV|  =  2TO+n  =  2C  and  the  total 
number  of  links  |£j  =  c  •  2C_1  —  m-(2  K  We  represent  c  =  2k  +  I,  where  0  <  /  <  2*-1,  that  is, 
k  =  [logcj  and  l  =  c  -  2 L,08 CJ . 


Theorem  2  To  obtain  an  ( m,n)-CCCube  with  a  minimum  number  of  links,  the  se'ection  of  in, 
under  a  given  constard  c  =  m  +  n,  where  c  =  2k  +  Z,0  <  l  <  2k  -  1,  is  as  follows: 

1.  If  l  >  k  —  2  then  m  =  2k  +  l  —  k  —  1,  namely  m  =  c  —  [logcj  -  1,  and  the  minimum  number 
of  links  is  \E\  =  2c~llo*cJ-1(c  +  ([logcj  +  l)2l'°8cJ+1  -  [logcj). 

2.  If  l  <  k  ~  2  then  m  =  2k  +  l  -  k  and  the  minimum  number  of  links  is  \E\  =  2c-llogcJ(c  + 
[logcJ2llo«cJ  _  [logcj  +  1). 

Proof:  When  c  =  m  +  n  is  fixed,  to  obtain  the  minimum  value  c  •  2C_1  -  —  ^-^-2  -1  of  the 

number  of  links  in  an  (m,  n)-CCCube,  with  a  given  constant  c  =  m  +  n,  is  equivalent  to  obtaining 
the  maximum  value  of  /(m)  =  m(2c-2m).  Note  that  /(m+l)-/(m)  =  (m+l)(2c-2m+1)-m(2c- 
2m)  =  2C  —  2m(m  +  2)  is  monotone  decreasing.  Therefore,  at  p  =  max{m  +  l|/(m  +  1)  -  /(m)  > 
0}»/(m)  reaches  its  maximum  value,  f(p).  Also,  if  f(p)  -  f(p  -  1)  =  0,  both  f(p)  and  f(p  -  1) 
have  the  maximum  value. 

To  find  p  we  first  determine  its  range  by  considering  the  following  two  cases: 

1.  If  m  =  2k  +  l  —  k  +  1,  then 


/(m+  l)-/(m) 


therefore,  p  <2k  +  l  —  k  +  1. 
2.  If  m  =  2k  +  l  -  k  -  1,  then 


2C  -  22k+l-k+' {2k  +  l  -  k  +  \  +  2) 

2c-k+i(2k--i  _2fc_/  +  fc_3) 

-2c~k+'  (2k~1  +1  +  3- k)  <  0; 


f(m+  1 )-  f(m) 


2C  -  22k+‘~k-'i  {2k  +  l  -  k  -  1  +  2) 
2c-k'\2k+'  -  2k  -  l  +  k  —  1) 
2c-k~\2k  -  1  -1+  k)  >  0; 


therefore,  p  >  2k  +  l  -  k. 
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Table  1.  Optimal  selection  of  m’s  under  given  c’s,  1  <  c  <  32 


c 

.  P 

c, 

P 

c,  p 

c, 

P 

1 

0 

9 

6,7 

17 

14 

25 

21 

2 

1 

10 

7 

18 

14,15 

26 

22 

3 

2 

11 

8 

19 

15 

27 

23 

4 

2,3 

12 

9 

20 

16 

28 

24 

5 

3 

13 

10 

21 

17 

29 

25 

6 

4 

14 

11 

22 

18 

30 

26 

7 

5 

15 

12 

23 

19 

31 

27 

8 

6 

16 

13 

24 

20 

32 

28 

Table  2.  The  number  of  links  in  optimal  CCCubes  and  in  compatible  hypercubes 


cufcei locccubt 

c 

s  ^cubei  locccube 

1 

1 

1 

9 

960 

2304 

2 

3 

4 

10 

1984 

5120 

3 

8 

12 

11 

4096 

11264 

4 

20 

32 

12 

8448 

24576 

5 

44 

80 

13 

17408 

53248 

6 

96 

192 

14 

35840 

114688 

7 

208 

448 

15 

73728 

245760 

8 

448 

1024 

16 

151552 

524288 

With  the  above  determined  range  of  p,  let  us  examine  the  case  where  m  =  2k  +  l  -  A', 


/(m+l)-/(m)  =  2C  -  22‘+,-*(2*  +  /  -  k  +  2) 

-  -2 c'k(l  -*  +  *>) 

Therefore,  when  /  -  fc  +  2  <  0,  p  =  2k  +  t  -  k  +  1;  and  when  l  -  k  +  2  >  0,  p  =  2k  +  l  -  k.  □ 

Table  1  shows  those  p’s  under  given  c’s,  with  c  ranging  from  1  to  32.  Table  2  compares  optimal 
CCCubes  with  compatible  hypercubes  in  terms  of  number  of  links  used,  where  c  stands  for  the 
dimension  of  hypercubes,  l0CCcube  for  the  number  of  links  in  optimal  CCCubes,  and  /cn ^  for  the 
number  of  links  in  hypercubes.  Figure  3  shows  the  optimal  CCCube  structure  with  c  ranging  from 
1  to  5. 

Figure  4  shows  the  comparison  between  the  standard  hypercube  and  the  optimal  CCCube  in 
terms  of  link  usage  in  the  embedding  of  binomial  trees,  which  is  measured  by  the  number  of  edges 
in  a  binomial  tree  divided  by  the  total  number  of  edges  in  hypercubes  or  optimal  CCCubes. 
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(2,2)-CCCube 


(3,2)-CCCube 


Figure  3.  Optimal  CCCubes 


Figure  4.  Link  usage  in  standard  hypercubes  and  optimal  CCCubes 

4  Execution  of  Parallel  Algorithms  on  Optimal  CCCubes 

The  most  conventional  graph  representations  of  parallel- and- conquer  algorithms  are  tTees,  such  as 
binary  trees  and  binomial  trees.  Divided-and-conquer  algorithms  normally  involve  three  steps[9]: 
broadcasting,  computation,  and  aggregation.  The  broadcasting  phase  distributes  load  to  different 
nodes  from  one  or  more  I/O  nodes  which  has  I/O  function.  The  load  should  be  evenly  allocated  to 
all  nodes  to  reduce  total  execution  time.  The  computation  phase  performs  the  computation  required 
by  each  subproblem.  The  aggregation  phase  is  normally  a  reverse  procedure  of  broadcasting,  and 
represents  a  collection  process  of  results. 

We  study  a  computation  structure  based  on  a  semi-binomial  tree  to  implement  parallel  divide- 
and-conquer  algorithms.  In  a  seioj- binomial  tree,  every  node  in  the  second  level  of  the  tree  is  the 
root  node  of  a  binomial  tree.  Figure  5  shows  a  semi- binomial  tree  with  two  second  level  nodes  each 
of  which  is  the  root  node  of  a  S3.  In  a  CCCube  structure,  if  we  use  the  host  as  the  root  node  of  a 
semi-binomial  tree  and  each  I/O  node  (normally  a  port  node)  as  the  node  at  the  second  level  of  the 
tree,  we  can  easily  construct  a  spanning  semi-binomial  tree.  For  example,  when  both  port  nodes  in 
the  optimal  (l,3)-CCCube  are  I/O  nodes,  the  semi-binomial  tree  in  Figure  o  is  the  corresponding 
spanning  tree. 

The  outline  of  a  parallel  divide-and-conquer  algorithms  based  on  the  semi-binomial  tree  struc¬ 
ture  is  as  follows: 

1.  Give  the  host  the  problem  to  be  solved. 
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Host 


Figure  5.  A  semi- binomial  tree 

2.  The  host  divides  the  problem  into  m  subproblems  and  assigns  each  to  a  distinct  I/O  node  in 
a  CCCube.  Normally  m  is  the  number  of  I/O  nodes. 

3.  Each  I/O  node  (the  root  node  of  a  binomial  tree)  divides  the  subproblem  in  half  and  passes 
the  first  half  to  the  child  which  has  the  most  descendants  and  has  not  yet  received  work.  The 
same  process  is  applied  to  the  second  half,  until  all  children  receive  work. 

4.  Every  node  performs  the  required  work  associated  with  each  subproblem. 

5.  The  results  are  passed  back  to  each  I/O  node,  and  are  merged  in  the  reverse  order  when 
subproblems  are  passed  down  the  tree. 

6.  The  host  collects  results  from  each  I/O  node. 

Note  that  in  the  above  scheme,  step  1  to  step  3  corresponds  to  the  broadcasting  phase.  Step 
4  is  the  computation  phase  where  every  node  computes  at  the  same  step.  Steps  5  and  6  are  the 
aggregation  phase.  To  prevent  potential  bottleneck  at  the  host,  computations  at  step  1  and  step  6 
should  be  relatively  light. 

We  use  the  merge  sorting  algorithm  to  illustrate  the  proposed  approach.  Suppose  a  list  of  32 
elements  (3,2,  12,7,5, 1,  13,45,23.43,8,0. 1 1,34,  15, 16.4,9,25.30.21,31.54,  78.80, 93. 63. 64. 
29,20,10,41)  is  to  be  sorted  in  the  optimal  (l,3)-CCCube  with  two  I/O  nodes:  //O'"  and 
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(  3.  2.  12.  7.  5.  1.  13.  45.  23.  43.  8.  0.  11.  34.  15.  16  ) 


(23.  43,  8.  0.  11.  34,  15.  16) 


Figure  6.  Broadcasting  phase  of  merge  sorting 

I /0^\ see  Figure  3).  First,  the  host  divides  the  list  into  two  sublists  of  length  16.  Suppose  //O*1' 
receives  sublist  (3,2, 12,7,5, 1, 13,45,23,43,8,0, 11,34, 15, 16).  The  sorting  process  of  a  sublist  as¬ 
signed  to  7/0^1  is  demonstrated  in  Figures  6  and  7.  Figure  6  shows  the  broadcasting  process. 
At  the  computation  step  every  node,  including  I/0^\  performs  a  swap  operation  of  two  elements 
if  necessary.  The  aggregation  phase  (Figure  7)  resembles  the  broadcasting  phase,  but  the  mes¬ 
sage  is  distributed  in  the  reverse  order.  At  the  end  of  the  aggregation  phase,  the  I/O W  has  the 
sorted  sublist  (0,1,2,3,5,7,8,11,12,13,15,16,23,34,43,45).  Similarly,  I/O ^  has  the  sorted  sub¬ 
list  (4,9, 10,20,21,25,29,30,31,41,54,63,64,78,89,93).  Finally,  the  host  collects  and  merges  these 
two  sorted  sublists. 

The  proposed  parallel  divide-and-conquer  algorithms  can  be  implemented  in  regular  CCCubes 
and  hypercubes.  Since  there  is  no  performance  degradation  when  they  are  implemented  in  the 
CCCubes  which  use  the  fewest  number  of  links,  the  optimal  CCCube  is  a  cost-effective  structure 
for  implementing  this  class  of  algorithms. 
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( 0.  1  ,  2.  3.  5.  7,  8, 11, 12.  13.  IS.  16.  23.  34.  43.  45) 


(0.  8.  11.  IS.  16.  23.34.43) 


Figure  7.  Aggregation  phase  of  merge  sorting 

5  Conclusions 

This  paper  explored  in  detail  some  properties  of  the  Cube-Connected  Cube  (CCCube)  structure, 
a  variant  of  the  hypercube  structure,  with  each  node  replaced  by  a  cube.  We  considered  first  the 
embedding  of  binomial  tree,  a  useful  structure  for  divide-and-conquer  types  of  parallel  algorithms, 
into  a  CCCube.  It  was  proved  that  an  i-level  binomial  tree  can  be  embedded  into  any  (m,  n)- 
CCCube,  where  m  is  the  dimension  of  outer  cube  and  n  is  the  dimension  of  the  inner  cube, 
provided  that  m  +  n  >  i.  With  the  objective  of  embedding  a  binomial  tree  into  a  CCCube  with 
a  minimum  number  of  links,  the  selection  of  an  optimal  (m,  n)-CCCube  under  a  given  constant 
c  =  m  +  n  was  provided  in  this  paper.  Comparison  was  also  made  between  an  (m,  n)-CCCube 
with  a  c-dimensional  hypercube  in  terms  of  the  link  usage  in  the  embedding  of  a  c-level  binomial 
tree.  A  class  of  parallel  divide-and-conquer  algorithm  was  proposed  based  on  a  semi-binomial  tree 
structure.  It  was  shown  that  optimal  CCCube  is  a  cost-effective  structure  to  implement  such  class 
of  algorithms. 
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